In [ ]:
!pip install pandas
!pip install seaborn
!pip install matplotlib
Requirement already satisfied: pandas in ./.venv/lib/python3.11/site-packages (2.2.0)
Requirement already satisfied: numpy<2,>=1.23.2 in ./.venv/lib/python3.11/site-packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in ./.venv/lib/python3.11/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.11/site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in ./.venv/lib/python3.11/site-packages (from pandas) (2024.1)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Collecting lxml
  Downloading lxml-5.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.5 kB)
Downloading lxml-5.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.1/8.1 MB 8.1 MB/s eta 0:00:0000:0100:01
Installing collected packages: lxml
Successfully installed lxml-5.1.0
Collecting lightgbm
  Downloading lightgbm-4.3.0-py3-none-manylinux_2_28_x86_64.whl.metadata (19 kB)
Requirement already satisfied: numpy in ./.venv/lib/python3.11/site-packages (from lightgbm) (1.26.4)
Collecting scipy (from lightgbm)
  Downloading scipy-1.12.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 60.4/60.4 kB 881.3 kB/s eta 0:00:00 0:00:01
Downloading lightgbm-4.3.0-py3-none-manylinux_2_28_x86_64.whl (3.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.1/3.1 MB 4.8 MB/s eta 0:00:00a 0:00:01
Downloading scipy-1.12.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (38.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 38.4/38.4 MB 1.1 MB/s eta 0:00:0000:0100:010m
Installing collected packages: scipy, lightgbm
Successfully installed lightgbm-4.3.0 scipy-1.12.0
Requirement already satisfied: pandas in ./.venv/lib/python3.11/site-packages (2.2.0)
Requirement already satisfied: numpy<2,>=1.23.2 in ./.venv/lib/python3.11/site-packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in ./.venv/lib/python3.11/site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.11/site-packages (from pandas) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in ./.venv/lib/python3.11/site-packages (from pandas) (2024.1)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Requirement already satisfied: numpy in ./.venv/lib/python3.11/site-packages (1.26.4)
Collecting seaborn
  Downloading seaborn-0.13.2-py3-none-any.whl.metadata (5.4 kB)
Requirement already satisfied: numpy!=1.24.0,>=1.20 in ./.venv/lib/python3.11/site-packages (from seaborn) (1.26.4)
Requirement already satisfied: pandas>=1.2 in ./.venv/lib/python3.11/site-packages (from seaborn) (2.2.0)
Collecting matplotlib!=3.6.1,>=3.4 (from seaborn)
  Downloading matplotlib-3.8.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.8 kB)
Collecting contourpy>=1.0.1 (from matplotlib!=3.6.1,>=3.4->seaborn)
  Downloading contourpy-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.8 kB)
Collecting cycler>=0.10 (from matplotlib!=3.6.1,>=3.4->seaborn)
  Downloading cycler-0.12.1-py3-none-any.whl.metadata (3.8 kB)
Collecting fonttools>=4.22.0 (from matplotlib!=3.6.1,>=3.4->seaborn)
  Downloading fonttools-4.49.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (159 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 159.1/159.1 kB 1.3 MB/s eta 0:00:00a 0:00:01
Collecting kiwisolver>=1.3.1 (from matplotlib!=3.6.1,>=3.4->seaborn)
  Downloading kiwisolver-1.4.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.4 kB)
Requirement already satisfied: packaging>=20.0 in ./.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (23.2)
Collecting pillow>=8 (from matplotlib!=3.6.1,>=3.4->seaborn)
  Downloading pillow-10.2.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (9.7 kB)
Collecting pyparsing>=2.3.1 (from matplotlib!=3.6.1,>=3.4->seaborn)
  Downloading pyparsing-3.1.1-py3-none-any.whl.metadata (5.1 kB)
Requirement already satisfied: python-dateutil>=2.7 in ./.venv/lib/python3.11/site-packages (from matplotlib!=3.6.1,>=3.4->seaborn) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.11/site-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in ./.venv/lib/python3.11/site-packages (from pandas>=1.2->seaborn) (2024.1)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib!=3.6.1,>=3.4->seaborn) (1.16.0)
Downloading seaborn-0.13.2-py3-none-any.whl (294 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 294.9/294.9 kB 21.9 MB/s eta 0:00:00
Downloading matplotlib-3.8.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.6/11.6 MB 2.8 MB/s eta 0:00:0000:0100:01
Downloading contourpy-1.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (313 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 313.4/313.4 kB 2.1 MB/s eta 0:00:00a 0:00:01m
Downloading cycler-0.12.1-py3-none-any.whl (8.3 kB)
Downloading fonttools-4.49.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.9/4.9 MB 2.6 MB/s eta 0:00:0000:0100:01m
Downloading kiwisolver-1.4.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.4/1.4 MB 1.4 MB/s eta 0:00:00a 0:00:01
Downloading pillow-10.2.0-cp311-cp311-manylinux_2_28_x86_64.whl (4.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.5/4.5 MB 748.2 kB/s eta 0:00:0000:0100:01
Downloading pyparsing-3.1.1-py3-none-any.whl (103 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 103.1/103.1 kB 1.3 MB/s eta 0:00:0000:010:01
Installing collected packages: pyparsing, pillow, kiwisolver, fonttools, cycler, contourpy, matplotlib, seaborn
Successfully installed contourpy-1.2.0 cycler-0.12.1 fonttools-4.49.0 kiwisolver-1.4.5 matplotlib-3.8.3 pillow-10.2.0 pyparsing-3.1.1 seaborn-0.13.2
Requirement already satisfied: matplotlib in ./.venv/lib/python3.11/site-packages (3.8.3)
Requirement already satisfied: contourpy>=1.0.1 in ./.venv/lib/python3.11/site-packages (from matplotlib) (1.2.0)
Requirement already satisfied: cycler>=0.10 in ./.venv/lib/python3.11/site-packages (from matplotlib) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in ./.venv/lib/python3.11/site-packages (from matplotlib) (4.49.0)
Requirement already satisfied: kiwisolver>=1.3.1 in ./.venv/lib/python3.11/site-packages (from matplotlib) (1.4.5)
Requirement already satisfied: numpy<2,>=1.21 in ./.venv/lib/python3.11/site-packages (from matplotlib) (1.26.4)
Requirement already satisfied: packaging>=20.0 in ./.venv/lib/python3.11/site-packages (from matplotlib) (23.2)
Requirement already satisfied: pillow>=8 in ./.venv/lib/python3.11/site-packages (from matplotlib) (10.2.0)
Requirement already satisfied: pyparsing>=2.3.1 in ./.venv/lib/python3.11/site-packages (from matplotlib) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in ./.venv/lib/python3.11/site-packages (from matplotlib) (2.8.2)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0)
Collecting sklearn
  Downloading sklearn-0.0.post12.tar.gz (2.6 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... error
  error: subprocess-exited-with-error
  
  × Getting requirements to build wheel did not run successfully.
  │ exit code: 1
  ╰─> [15 lines of output]
      The 'sklearn' PyPI package is deprecated, use 'scikit-learn'
      rather than 'sklearn' for pip commands.
      
      Here is how to fix this error in the main use cases:
      - use 'pip install scikit-learn' rather than 'pip install sklearn'
      - replace 'sklearn' by 'scikit-learn' in your pip requirements files
        (requirements.txt, setup.py, setup.cfg, Pipfile, etc ...)
      - if the 'sklearn' package is used by one of your dependencies,
        it would be great if you take some time to track which package uses
        'sklearn' instead of 'scikit-learn' and report it to their issue tracker
      - as a last resort, set the environment variable
        SKLEARN_ALLOW_DEPRECATED_SKLEARN_PACKAGE_INSTALL=True to avoid this error
      
      More information is available at
      https://github.com/scikit-learn/sklearn-pypi-package
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

× Getting requirements to build wheel did not run successfully.
│ exit code: 1
╰─> See above for output.

note: This error originates from a subprocess, and is likely not a problem with pip.
Collecting phik
  Downloading phik-0.12.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Requirement already satisfied: numpy>=1.18.0 in ./.venv/lib/python3.11/site-packages (from phik) (1.26.4)
Requirement already satisfied: scipy>=1.5.2 in ./.venv/lib/python3.11/site-packages (from phik) (1.12.0)
Requirement already satisfied: pandas>=0.25.1 in ./.venv/lib/python3.11/site-packages (from phik) (2.2.0)
Requirement already satisfied: matplotlib>=2.2.3 in ./.venv/lib/python3.11/site-packages (from phik) (3.8.3)
Collecting joblib>=0.14.1 (from phik)
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Requirement already satisfied: contourpy>=1.0.1 in ./.venv/lib/python3.11/site-packages (from matplotlib>=2.2.3->phik) (1.2.0)
Requirement already satisfied: cycler>=0.10 in ./.venv/lib/python3.11/site-packages (from matplotlib>=2.2.3->phik) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in ./.venv/lib/python3.11/site-packages (from matplotlib>=2.2.3->phik) (4.49.0)
Requirement already satisfied: kiwisolver>=1.3.1 in ./.venv/lib/python3.11/site-packages (from matplotlib>=2.2.3->phik) (1.4.5)
Requirement already satisfied: packaging>=20.0 in ./.venv/lib/python3.11/site-packages (from matplotlib>=2.2.3->phik) (23.2)
Requirement already satisfied: pillow>=8 in ./.venv/lib/python3.11/site-packages (from matplotlib>=2.2.3->phik) (10.2.0)
Requirement already satisfied: pyparsing>=2.3.1 in ./.venv/lib/python3.11/site-packages (from matplotlib>=2.2.3->phik) (3.1.1)
Requirement already satisfied: python-dateutil>=2.7 in ./.venv/lib/python3.11/site-packages (from matplotlib>=2.2.3->phik) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in ./.venv/lib/python3.11/site-packages (from pandas>=0.25.1->phik) (2024.1)
Requirement already satisfied: tzdata>=2022.7 in ./.venv/lib/python3.11/site-packages (from pandas>=0.25.1->phik) (2024.1)
Requirement already satisfied: six>=1.5 in ./.venv/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib>=2.2.3->phik) (1.16.0)
Downloading phik-0.12.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (687 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 687.8/687.8 kB 2.6 MB/s eta 0:00:0000:0100:01
Downloading joblib-1.3.2-py3-none-any.whl (302 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 302.2/302.2 kB 20.3 MB/s eta 0:00:00
Installing collected packages: joblib, phik
Successfully installed joblib-1.3.2 phik-0.12.4
ERROR: Could not find a version that satisfies the requirement random (from versions: none)
ERROR: No matching distribution found for random
Collecting scikit-learn
  Downloading scikit_learn-1.4.1.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (11 kB)
Requirement already satisfied: numpy<2.0,>=1.19.5 in ./.venv/lib/python3.11/site-packages (from scikit-learn) (1.26.4)
Requirement already satisfied: scipy>=1.6.0 in ./.venv/lib/python3.11/site-packages (from scikit-learn) (1.12.0)
Requirement already satisfied: joblib>=1.2.0 in ./.venv/lib/python3.11/site-packages (from scikit-learn) (1.3.2)
Collecting threadpoolctl>=2.0.0 (from scikit-learn)
  Downloading threadpoolctl-3.3.0-py3-none-any.whl.metadata (13 kB)
Downloading scikit_learn-1.4.1.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.1/12.1 MB 7.9 MB/s eta 0:00:0000:0100:01
Downloading threadpoolctl-3.3.0-py3-none-any.whl (17 kB)
Installing collected packages: threadpoolctl, scikit-learn
Successfully installed scikit-learn-1.4.1.post1 threadpoolctl-3.3.0
Collecting tensorflow
  Downloading tensorflow-2.15.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.2 kB)
Collecting absl-py>=1.0.0 (from tensorflow)
  Downloading absl_py-2.1.0-py3-none-any.whl.metadata (2.3 kB)
Collecting astunparse>=1.6.0 (from tensorflow)
  Downloading astunparse-1.6.3-py2.py3-none-any.whl (12 kB)
Collecting flatbuffers>=23.5.26 (from tensorflow)
  Downloading flatbuffers-23.5.26-py2.py3-none-any.whl.metadata (850 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow)
  Downloading gast-0.5.4-py3-none-any.whl.metadata (1.3 kB)
Collecting google-pasta>=0.1.1 (from tensorflow)
  Downloading google_pasta-0.2.0-py3-none-any.whl (57 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 57.5/57.5 kB 552.4 kB/s eta 0:00:00:--:--
Collecting h5py>=2.9.0 (from tensorflow)
  Downloading h5py-3.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.5 kB)
Collecting libclang>=13.0.0 (from tensorflow)
  Downloading libclang-16.0.6-py2.py3-none-manylinux2010_x86_64.whl.metadata (5.2 kB)
Collecting ml-dtypes~=0.2.0 (from tensorflow)
  Downloading ml_dtypes-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)
Requirement already satisfied: numpy<2.0.0,>=1.23.5 in ./.venv/lib/python3.11/site-packages (from tensorflow) (1.26.4)
Collecting opt-einsum>=2.3.2 (from tensorflow)
  Downloading opt_einsum-3.3.0-py3-none-any.whl (65 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.5/65.5 kB 999.4 kB/s eta 0:00:00 0:00:01
Requirement already satisfied: packaging in ./.venv/lib/python3.11/site-packages (from tensorflow) (23.2)
Collecting protobuf!=4.21.0,!=4.21.1,!=4.21.2,!=4.21.3,!=4.21.4,!=4.21.5,<5.0.0dev,>=3.20.3 (from tensorflow)
  Downloading protobuf-4.25.3-cp37-abi3-manylinux2014_x86_64.whl.metadata (541 bytes)
Requirement already satisfied: setuptools in ./.venv/lib/python3.11/site-packages (from tensorflow) (65.5.0)
Requirement already satisfied: six>=1.12.0 in ./.venv/lib/python3.11/site-packages (from tensorflow) (1.16.0)
Collecting termcolor>=1.1.0 (from tensorflow)
  Downloading termcolor-2.4.0-py3-none-any.whl.metadata (6.1 kB)
Collecting typing-extensions>=3.6.6 (from tensorflow)
  Downloading typing_extensions-4.9.0-py3-none-any.whl.metadata (3.0 kB)
Collecting wrapt<1.15,>=1.11.0 (from tensorflow)
  Downloading wrapt-1.14.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Collecting tensorflow-io-gcs-filesystem>=0.23.1 (from tensorflow)
  Downloading tensorflow_io_gcs_filesystem-0.36.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (14 kB)
Collecting grpcio<2.0,>=1.24.3 (from tensorflow)
  Downloading grpcio-1.60.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.0 kB)
Collecting tensorboard<2.16,>=2.15 (from tensorflow)
  Downloading tensorboard-2.15.2-py3-none-any.whl.metadata (1.7 kB)
Collecting tensorflow-estimator<2.16,>=2.15.0 (from tensorflow)
  Downloading tensorflow_estimator-2.15.0-py2.py3-none-any.whl.metadata (1.3 kB)
Collecting keras<2.16,>=2.15.0 (from tensorflow)
  Downloading keras-2.15.0-py3-none-any.whl.metadata (2.4 kB)
Collecting wheel<1.0,>=0.23.0 (from astunparse>=1.6.0->tensorflow)
  Downloading wheel-0.42.0-py3-none-any.whl.metadata (2.2 kB)
Collecting google-auth<3,>=1.6.3 (from tensorboard<2.16,>=2.15->tensorflow)
  Downloading google_auth-2.28.0-py2.py3-none-any.whl.metadata (4.7 kB)
Collecting google-auth-oauthlib<2,>=0.5 (from tensorboard<2.16,>=2.15->tensorflow)
  Downloading google_auth_oauthlib-1.2.0-py2.py3-none-any.whl.metadata (2.7 kB)
Collecting markdown>=2.6.8 (from tensorboard<2.16,>=2.15->tensorflow)
  Downloading Markdown-3.5.2-py3-none-any.whl.metadata (7.0 kB)
Collecting requests<3,>=2.21.0 (from tensorboard<2.16,>=2.15->tensorflow)
  Downloading requests-2.31.0-py3-none-any.whl.metadata (4.6 kB)
Collecting tensorboard-data-server<0.8.0,>=0.7.0 (from tensorboard<2.16,>=2.15->tensorflow)
  Downloading tensorboard_data_server-0.7.2-py3-none-any.whl.metadata (1.1 kB)
Collecting werkzeug>=1.0.1 (from tensorboard<2.16,>=2.15->tensorflow)
  Downloading werkzeug-3.0.1-py3-none-any.whl.metadata (4.1 kB)
Collecting cachetools<6.0,>=2.0.0 (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow)
  Downloading cachetools-5.3.2-py3-none-any.whl.metadata (5.2 kB)
Collecting pyasn1-modules>=0.2.1 (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow)
  Downloading pyasn1_modules-0.3.0-py2.py3-none-any.whl.metadata (3.6 kB)
Collecting rsa<5,>=3.1.4 (from google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow)
  Downloading rsa-4.9-py3-none-any.whl (34 kB)
Collecting requests-oauthlib>=0.7.0 (from google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow)
  Downloading requests_oauthlib-1.3.1-py2.py3-none-any.whl (23 kB)
Collecting charset-normalizer<4,>=2 (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow)
  Downloading charset_normalizer-3.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (33 kB)
Collecting idna<4,>=2.5 (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow)
  Downloading idna-3.6-py3-none-any.whl.metadata (9.9 kB)
Collecting urllib3<3,>=1.21.1 (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow)
  Downloading urllib3-2.2.1-py3-none-any.whl.metadata (6.4 kB)
Collecting certifi>=2017.4.17 (from requests<3,>=2.21.0->tensorboard<2.16,>=2.15->tensorflow)
  Downloading certifi-2024.2.2-py3-none-any.whl.metadata (2.2 kB)
Collecting MarkupSafe>=2.1.1 (from werkzeug>=1.0.1->tensorboard<2.16,>=2.15->tensorflow)
  Downloading MarkupSafe-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pyasn1<0.6.0,>=0.4.6 (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.16,>=2.15->tensorflow)
  Downloading pyasn1-0.5.1-py2.py3-none-any.whl.metadata (8.6 kB)
Collecting oauthlib>=3.0.0 (from requests-oauthlib>=0.7.0->google-auth-oauthlib<2,>=0.5->tensorboard<2.16,>=2.15->tensorflow)
  Downloading oauthlib-3.2.2-py3-none-any.whl.metadata (7.5 kB)
Downloading tensorflow-2.15.0.post1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (475.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 475.3/475.3 MB 1.3 MB/s eta 0:00:0000:01m0:10m
Downloading absl_py-2.1.0-py3-none-any.whl (133 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.7/133.7 kB 2.3 MB/s eta 0:00:00 0:00:01
Downloading flatbuffers-23.5.26-py2.py3-none-any.whl (26 kB)
Downloading gast-0.5.4-py3-none-any.whl (19 kB)
Downloading grpcio-1.60.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.4 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.4/5.4 MB 1.7 MB/s eta 0:00:0000:0100:01
Downloading h5py-3.10.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 4.8/4.8 MB 1.2 MB/s eta 0:00:0000:0100:01
Downloading keras-2.15.0-py3-none-any.whl (1.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.7/1.7 MB 918.5 kB/s eta 0:00:00a 0:00:01
Downloading libclang-16.0.6-py2.py3-none-manylinux2010_x86_64.whl (22.9 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 22.9/22.9 MB 4.2 MB/s eta 0:00:0000:0100:01
Downloading ml_dtypes-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.0/1.0 MB 5.0 MB/s eta 0:00:00a 0:00:01m
Downloading protobuf-4.25.3-cp37-abi3-manylinux2014_x86_64.whl (294 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 294.6/294.6 kB 6.0 MB/s eta 0:00:00a 0:00:01
Downloading tensorboard-2.15.2-py3-none-any.whl (5.5 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.5/5.5 MB 5.2 MB/s eta 0:00:0000:0100:01
Downloading tensorflow_estimator-2.15.0-py2.py3-none-any.whl (441 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 442.0/442.0 kB 4.4 MB/s eta 0:00:00a 0:00:01m
Downloading tensorflow_io_gcs_filesystem-0.36.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (5.1 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 5.1/5.1 MB 3.8 MB/s eta 0:00:0000:0100:01m
Downloading termcolor-2.4.0-py3-none-any.whl (7.7 kB)
Downloading typing_extensions-4.9.0-py3-none-any.whl (32 kB)
Downloading wrapt-1.14.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (78 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 78.4/78.4 kB 5.9 MB/s eta 0:00:00
Downloading google_auth-2.28.0-py2.py3-none-any.whl (186 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 186.9/186.9 kB 4.3 MB/s eta 0:00:0000:01
Downloading google_auth_oauthlib-1.2.0-py2.py3-none-any.whl (24 kB)
Downloading Markdown-3.5.2-py3-none-any.whl (103 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 103.9/103.9 kB 16.4 MB/s eta 0:00:00
Downloading requests-2.31.0-py3-none-any.whl (62 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.6/62.6 kB 5.2 MB/s eta 0:00:00
Downloading tensorboard_data_server-0.7.2-py3-none-any.whl (2.4 kB)
Downloading werkzeug-3.0.1-py3-none-any.whl (226 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 226.7/226.7 kB 5.4 MB/s eta 0:00:0000:01
Downloading wheel-0.42.0-py3-none-any.whl (65 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.4/65.4 kB 6.8 MB/s eta 0:00:00
Downloading cachetools-5.3.2-py3-none-any.whl (9.3 kB)
Downloading certifi-2024.2.2-py3-none-any.whl (163 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 163.8/163.8 kB 5.3 MB/s eta 0:00:00
Downloading charset_normalizer-3.3.2-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (140 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 140.3/140.3 kB 6.8 MB/s eta 0:00:00
Downloading idna-3.6-py3-none-any.whl (61 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.6/61.6 kB 3.3 MB/s eta 0:00:00
Downloading MarkupSafe-2.1.5-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (28 kB)
Downloading pyasn1_modules-0.3.0-py2.py3-none-any.whl (181 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 181.3/181.3 kB 10.6 MB/s eta 0:00:00
Downloading urllib3-2.2.1-py3-none-any.whl (121 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 121.1/121.1 kB 2.5 MB/s eta 0:00:00a 0:00:01
Downloading oauthlib-3.2.2-py3-none-any.whl (151 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 151.7/151.7 kB 6.4 MB/s eta 0:00:00
Downloading pyasn1-0.5.1-py2.py3-none-any.whl (84 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 84.9/84.9 kB 6.7 MB/s eta 0:00:00
Installing collected packages: libclang, flatbuffers, wrapt, wheel, urllib3, typing-extensions, termcolor, tensorflow-io-gcs-filesystem, tensorflow-estimator, tensorboard-data-server, pyasn1, protobuf, opt-einsum, oauthlib, ml-dtypes, MarkupSafe, markdown, keras, idna, h5py, grpcio, google-pasta, gast, charset-normalizer, certifi, cachetools, absl-py, werkzeug, rsa, requests, pyasn1-modules, astunparse, requests-oauthlib, google-auth, google-auth-oauthlib, tensorboard, tensorflow
Successfully installed MarkupSafe-2.1.5 absl-py-2.1.0 astunparse-1.6.3 cachetools-5.3.2 certifi-2024.2.2 charset-normalizer-3.3.2 flatbuffers-23.5.26 gast-0.5.4 google-auth-2.28.0 google-auth-oauthlib-1.2.0 google-pasta-0.2.0 grpcio-1.60.1 h5py-3.10.0 idna-3.6 keras-2.15.0 libclang-16.0.6 markdown-3.5.2 ml-dtypes-0.2.0 oauthlib-3.2.2 opt-einsum-3.3.0 protobuf-4.25.3 pyasn1-0.5.1 pyasn1-modules-0.3.0 requests-2.31.0 requests-oauthlib-1.3.1 rsa-4.9 tensorboard-2.15.2 tensorboard-data-server-0.7.2 tensorflow-2.15.0.post1 tensorflow-estimator-2.15.0 tensorflow-io-gcs-filesystem-0.36.0 termcolor-2.4.0 typing-extensions-4.9.0 urllib3-2.2.1 werkzeug-3.0.1 wheel-0.42.0 wrapt-1.14.1
In [ ]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from phik.report import plot_correlation_matrix
In [ ]:
test = pd.read_csv('test.csv')
train = pd.read_csv('train.csv')
In [ ]:
test
Out[ ]:
session_id site1 time1 site2 time2 site3 time3 site4 time4 site5 ... site6 time6 site7 time7 site8 time8 site9 time9 site10 time10
0 1 29 2014-10-04 35 2014-10-04 22 2014-10-04 321 2014-10-04 23 ... 2211 2014-10-04 6730 2014-10-04 21 2014-10-04 44582 2014-10-04 15336 2014-10-04
1 2 782 2014-07-03 782 2014-07-03 782 2014-07-03 782 2014-07-03 782 ... 782 2014-07-03 782 2014-07-03 782 2014-07-03 782 2014-07-03 782 2014-07-03
2 3 55 2014-12-05 55 2014-12-05 55 2014-12-05 55 2014-12-05 55 ... 55 2014-12-05 55 2014-12-05 55 2014-12-05 1445 2014-12-05 1445 2014-12-05
3 4 1023 2014-11-04 1022 2014-11-04 50 2014-11-04 222 2014-11-04 202 ... 3374 2014-11-04 50 2014-11-04 48 2014-11-04 48 2014-11-04 3374 2014-11-04
4 5 301 2014-05-16 301 2014-05-16 301 2014-05-16 66 2014-05-16 67 ... 69 2014-05-16 70 2014-05-16 68 2014-05-16 71 2014-05-16 167 2014-05-16
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
82792 82793 812 2014-10-02 1039 2014-10-02 676 2014-10-02 3 2014-05-27 167 ... 45064 2014-05-27 45065 2014-05-27 384 2014-05-27 23 2014-05-27 3346 2014-05-27
82793 82794 300 2014-05-26 302 2014-05-26 302 2014-05-26 300 2014-05-26 300 ... 1222 2014-05-26 302 2014-05-26 1218 2014-05-26 1221 2014-05-26 1216 2014-05-26
82794 82795 29 2014-05-02 33 2014-05-02 35 2014-05-02 22 2014-05-02 37 ... 6779 2014-05-02 30 2014-05-02 21 2014-05-02 23 2014-05-02 6780 2014-05-02
82795 82796 5828 2014-05-03 23 2014-05-03 21 2014-05-03 804 2014-05-03 21 ... 3350 2014-05-03 23 2014-05-03 894 2014-05-03 21 2014-05-03 961 2014-05-03
82796 82797 21 2014-11-02 1098 2014-11-02 1098 2014-11-02 1098 2014-11-02 1098 ... 1098 2014-11-02 1098 2014-11-02 1098 2014-11-02 1098 2014-11-02 1098 2014-11-02

82797 rows × 21 columns

In [ ]:
train
In [ ]:
full_df = pd.concat([train.drop('target', axis=1), test])

2.1 Визуальный анализ данных¶

In [ ]:
# Напишем функцию, которая принимает на вход DataFrame, кодирует числовыми значениями категориальные признаки
# и возвращает обновленный DataFrame и сами кодировщики.
def number_encode_features(init_df):
    result = init_df.copy() # копируем нашу исходную таблицу
    encoders = {}
    for column in result.columns:
        if result.dtypes[column] == object: # np.object -- строковый тип / если тип столбца - строка, то нужно его закодировать
            encoders[column] = preprocessing.LabelEncoder() # для колонки column создаем кодировщик
            result[column] = encoders[column].fit_transform(result[column]) # применяем кодировщик к столбцу и перезаписываем столбец
    return result, encoders

encoded_full, encoders = number_encode_features(full_df) # Теперь encoded data содержит закодированные кат. признаки
encoded_full.head()
Out[ ]:
session_id site1 time1 site2 time2 site3 time3 site4 time4 site5 ... site6 time6 site7 time7 site8 time8 site9 time9 site10 time10
0 2 890 80 941 79 3847 79 941 78 942 ... 3846 78 3847 78 3846 78 1516 78 1518 78
1 3 14769 31 39 31 14768 31 14769 31 37 ... 39 31 14768 31 14768 31 14768 31 14768 31
2 4 782 106 782 105 782 104 782 103 782 ... 782 103 782 103 782 103 782 103 782 103
3 5 22 86 177 85 175 85 178 84 177 ... 178 84 175 84 177 84 177 84 178 84
4 6 570 96 21 95 570 94 21 93 21 ... 178 84 175 84 177 84 177 84 178 84

5 rows × 21 columns

будем использовать визализацию Корреляция признаков по этой визуализации можно увидеть как признаки зависимы друг от друга

In [ ]:
features_target = encoded_full
interval_cols = encoded_full
phik_overview = features_target.phik_matrix(interval_cols=interval_cols)

plot_correlation_matrix(phik_overview.values,
                        x_labels=phik_overview.columns,
                        y_labels=phik_overview.index,
                        vmin=0, vmax=1, color_map="Greens",
                        title="Корреляция признаков",
                        fontsize_factor=1.5,
                        figsize=(20, 10))
plt.tight_layout()
No description has been provided for this image

и по этой визуализации видно что время все время зависимо друг от друга, а сайты не сильно зависимы друг от друга.

Далее будем использовать визуализацию ящик с усами она может показать как данные зависимы и как они разбросанны

In [ ]:
sns_plot = sns.pairplot(full_df)
sns_plot.savefig('pairplot.png')
No description has been provided for this image

Из данной визуализации видно что данные ОЧЕНЬ очень сильно расбросанны

далее визулизация displot она может показать зависимость данных в другом виде

In [ ]:
sns.set_theme(style="darkgrid")
df = full_df
sns.displot(
    df,
    binwidth=3, height=3, facet_kws=dict(margin_titles=True),
)
Out[ ]:
<seaborn.axisgrid.FacetGrid at 0x7d8d60441790>
No description has been provided for this image

По данной визуализации видно сильно много по сравнению с другой визуализацией

2.2 Конструирование признаков (Feature Engineering)¶

In [ ]:
full_df
Out[ ]:
session_id site1 time1 site2 time2 site3 time3 site4 time4 site5 ... site6 time6 site7 time7 site8 time8 site9 time9 site10 time10
0 2 890 2014-02-22 941 2014-02-22 3847 2014-02-22 941 2014-02-22 942 ... 3846 2014-02-22 3847 2014-02-22 3846 2014-02-22 1516 2014-02-22 1518 2014-02-22
1 3 14769 2013-12-16 39 2013-12-16 14768 2013-12-16 14769 2013-12-16 37 ... 39 2013-12-16 14768 2013-12-16 14768 2013-12-16 14768 2013-12-16 14768 2013-12-16
2 4 782 2014-03-28 782 2014-03-28 782 2014-03-28 782 2014-03-28 782 ... 782 2014-03-28 782 2014-03-28 782 2014-03-28 782 2014-03-28 782 2014-03-28
3 5 22 2014-02-28 177 2014-02-28 175 2014-02-28 178 2014-02-28 177 ... 178 2014-02-28 175 2014-02-28 177 2014-02-28 177 2014-02-28 178 2014-02-28
4 6 570 2014-03-18 21 2014-03-18 570 2014-03-18 21 2014-03-18 21 ... 178 2014-02-28 175 2014-02-28 177 2014-02-28 177 2014-02-28 178 2014-02-28
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
82792 82793 812 2014-10-02 1039 2014-10-02 676 2014-10-02 3 2014-05-27 167 ... 45064 2014-05-27 45065 2014-05-27 384 2014-05-27 23 2014-05-27 3346 2014-05-27
82793 82794 300 2014-05-26 302 2014-05-26 302 2014-05-26 300 2014-05-26 300 ... 1222 2014-05-26 302 2014-05-26 1218 2014-05-26 1221 2014-05-26 1216 2014-05-26
82794 82795 29 2014-05-02 33 2014-05-02 35 2014-05-02 22 2014-05-02 37 ... 6779 2014-05-02 30 2014-05-02 21 2014-05-02 23 2014-05-02 6780 2014-05-02
82795 82796 5828 2014-05-03 23 2014-05-03 21 2014-05-03 804 2014-05-03 21 ... 3350 2014-05-03 23 2014-05-03 894 2014-05-03 21 2014-05-03 961 2014-05-03
82796 82797 21 2014-11-02 1098 2014-11-02 1098 2014-11-02 1098 2014-11-02 1098 ... 1098 2014-11-02 1098 2014-11-02 1098 2014-11-02 1098 2014-11-02 1098 2014-11-02

336357 rows × 21 columns

убираем '-' из данных в колонках 'time'

In [ ]:
full_df['time1'] = full_df['time1'].str.replace('-', ' ')
full_df['time2'] = full_df['time2'].str.replace('-', ' ')
full_df['time3'] = full_df['time3'].str.replace('-', ' ')
full_df['time4'] = full_df['time4'].str.replace('-', ' ')
full_df['time5'] = full_df['time5'].str.replace('-', ' ')
full_df['time6'] = full_df['time6'].str.replace('-', ' ')
full_df['time7'] = full_df['time7'].str.replace('-', ' ')
full_df['time8'] = full_df['time8'].str.replace('-', ' ')
full_df['time9'] = full_df['time9'].str.replace('-', ' ')
full_df['time10'] = full_df['time10'].str.replace('-', ' ')
full_df
Out[ ]:
session_id site1 time1 site2 time2 site3 time3 site4 time4 site5 ... site6 time6 site7 time7 site8 time8 site9 time9 site10 time10
0 2 890 2014 02 22 941 2014 02 22 3847 2014 02 22 941 2014 02 22 942 ... 3846 2014 02 22 3847 2014 02 22 3846 2014 02 22 1516 2014 02 22 1518 2014 02 22
1 3 14769 2013 12 16 39 2013 12 16 14768 2013 12 16 14769 2013 12 16 37 ... 39 2013 12 16 14768 2013 12 16 14768 2013 12 16 14768 2013 12 16 14768 2013 12 16
2 4 782 2014 03 28 782 2014 03 28 782 2014 03 28 782 2014 03 28 782 ... 782 2014 03 28 782 2014 03 28 782 2014 03 28 782 2014 03 28 782 2014 03 28
3 5 22 2014 02 28 177 2014 02 28 175 2014 02 28 178 2014 02 28 177 ... 178 2014 02 28 175 2014 02 28 177 2014 02 28 177 2014 02 28 178 2014 02 28
4 6 570 2014 03 18 21 2014 03 18 570 2014 03 18 21 2014 03 18 21 ... 178 2014 02 28 175 2014 02 28 177 2014 02 28 177 2014 02 28 178 2014 02 28
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
82792 82793 812 2014 10 02 1039 2014 10 02 676 2014 10 02 3 2014 05 27 167 ... 45064 2014 05 27 45065 2014 05 27 384 2014 05 27 23 2014 05 27 3346 2014 05 27
82793 82794 300 2014 05 26 302 2014 05 26 302 2014 05 26 300 2014 05 26 300 ... 1222 2014 05 26 302 2014 05 26 1218 2014 05 26 1221 2014 05 26 1216 2014 05 26
82794 82795 29 2014 05 02 33 2014 05 02 35 2014 05 02 22 2014 05 02 37 ... 6779 2014 05 02 30 2014 05 02 21 2014 05 02 23 2014 05 02 6780 2014 05 02
82795 82796 5828 2014 05 03 23 2014 05 03 21 2014 05 03 804 2014 05 03 21 ... 3350 2014 05 03 23 2014 05 03 894 2014 05 03 21 2014 05 03 961 2014 05 03
82796 82797 21 2014 11 02 1098 2014 11 02 1098 2014 11 02 1098 2014 11 02 1098 ... 1098 2014 11 02 1098 2014 11 02 1098 2014 11 02 1098 2014 11 02 1098 2014 11 02

336357 rows × 21 columns

разделяем данные на несколько колонок

In [ ]:
full_df['timef1'] = full_df['time1'].apply(lambda x: x.split()[0])
full_df['timea1'] = full_df['time1'].apply(lambda x: x.split()[1])
full_df['timef2'] = full_df['time2'].apply(lambda x: x.split()[0])
full_df['timea2'] = full_df['time2'].apply(lambda x: x.split()[1])
full_df['timef3'] = full_df['time3'].apply(lambda x: x.split()[0])
full_df['timea3'] = full_df['time3'].apply(lambda x: x.split()[1])
full_df['timef4'] = full_df['time4'].apply(lambda x: x.split()[0])
full_df['timea4'] = full_df['time4'].apply(lambda x: x.split()[1])
full_df['timef5'] = full_df['time5'].apply(lambda x: x.split()[0])
full_df['timea5'] = full_df['time5'].apply(lambda x: x.split()[1])
full_df['timef6'] = full_df['time6'].apply(lambda x: x.split()[0])
full_df['timea6'] = full_df['time6'].apply(lambda x: x.split()[1])
full_df['timef7'] = full_df['time7'].apply(lambda x: x.split()[0])
full_df['timea7'] = full_df['time7'].apply(lambda x: x.split()[1])
full_df['timef8'] = full_df['time8'].apply(lambda x: x.split()[0])
full_df['timea8'] = full_df['time8'].apply(lambda x: x.split()[1])
full_df['timef9'] = full_df['time9'].apply(lambda x: x.split()[0])
full_df['timea9'] = full_df['time9'].apply(lambda x: x.split()[1])
full_df['timef10'] = full_df['time10'].apply(lambda x: x.split()[0])
full_df['timea10'] = full_df['time10'].apply(lambda x: x.split()[1])


full_df
Out[ ]:
session_id site1 time1 site2 time2 site3 time3 site4 time4 site5 ... timef6 timea6 timef7 timea7 timef8 timea8 timef9 timea9 timef10 timea10
0 2 890 2014 02 22 941 2014 02 22 3847 2014 02 22 941 2014 02 22 942 ... 2014 02 2014 02 2014 02 2014 02 2014 02
1 3 14769 2013 12 16 39 2013 12 16 14768 2013 12 16 14769 2013 12 16 37 ... 2013 12 2013 12 2013 12 2013 12 2013 12
2 4 782 2014 03 28 782 2014 03 28 782 2014 03 28 782 2014 03 28 782 ... 2014 03 2014 03 2014 03 2014 03 2014 03
3 5 22 2014 02 28 177 2014 02 28 175 2014 02 28 178 2014 02 28 177 ... 2014 02 2014 02 2014 02 2014 02 2014 02
4 6 570 2014 03 18 21 2014 03 18 570 2014 03 18 21 2014 03 18 21 ... 2014 02 2014 02 2014 02 2014 02 2014 02
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
82792 82793 812 2014 10 02 1039 2014 10 02 676 2014 10 02 3 2014 05 27 167 ... 2014 05 2014 05 2014 05 2014 05 2014 05
82793 82794 300 2014 05 26 302 2014 05 26 302 2014 05 26 300 2014 05 26 300 ... 2014 05 2014 05 2014 05 2014 05 2014 05
82794 82795 29 2014 05 02 33 2014 05 02 35 2014 05 02 22 2014 05 02 37 ... 2014 05 2014 05 2014 05 2014 05 2014 05
82795 82796 5828 2014 05 03 23 2014 05 03 21 2014 05 03 804 2014 05 03 21 ... 2014 05 2014 05 2014 05 2014 05 2014 05
82796 82797 21 2014 11 02 1098 2014 11 02 1098 2014 11 02 1098 2014 11 02 1098 ... 2014 11 2014 11 2014 11 2014 11 2014 11

336357 rows × 41 columns

удаляем старые колонки

In [ ]:
full_df = full_df.drop(['time1', 'time2', 'time3', 'time4', 'time5', 'time6', 'time7', 'time8', 'time9', 'time10'], axis= 1 ,)
full_df
Out[ ]:
session_id site1 site2 site3 site4 site5 site6 site7 site8 site9 ... timef6 timea6 timef7 timea7 timef8 timea8 timef9 timea9 timef10 timea10
0 2 890 941 3847 941 942 3846 3847 3846 1516 ... 2014 02 2014 02 2014 02 2014 02 2014 02
1 3 14769 39 14768 14769 37 39 14768 14768 14768 ... 2013 12 2013 12 2013 12 2013 12 2013 12
2 4 782 782 782 782 782 782 782 782 782 ... 2014 03 2014 03 2014 03 2014 03 2014 03
3 5 22 177 175 178 177 178 175 177 177 ... 2014 02 2014 02 2014 02 2014 02 2014 02
4 6 570 21 570 21 21 178 175 177 177 ... 2014 02 2014 02 2014 02 2014 02 2014 02
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
82792 82793 812 1039 676 3 167 45064 45065 384 23 ... 2014 05 2014 05 2014 05 2014 05 2014 05
82793 82794 300 302 302 300 300 1222 302 1218 1221 ... 2014 05 2014 05 2014 05 2014 05 2014 05
82794 82795 29 33 35 22 37 6779 30 21 23 ... 2014 05 2014 05 2014 05 2014 05 2014 05
82795 82796 5828 23 21 804 21 3350 23 894 21 ... 2014 05 2014 05 2014 05 2014 05 2014 05
82796 82797 21 1098 1098 1098 1098 1098 1098 1098 1098 ... 2014 11 2014 11 2014 11 2014 11 2014 11

336357 rows × 31 columns

соеденяем колонки

In [ ]:
full_df['time_s1'] = full_df['timef1'] + full_df['timea1']
full_df['time_s2'] = full_df['timef2'] + full_df['timea2']
full_df['time_s3'] = full_df['timef3'] + full_df['timea3']
full_df['time_s4'] = full_df['timef4'] + full_df['timea4']
full_df['time_s5'] = full_df['timef5'] + full_df['timea5']
full_df['time_s6'] = full_df['timef6'] + full_df['timea6']
full_df['time_s7'] = full_df['timef7'] + full_df['timea7']
full_df['time_s8'] = full_df['timef8'] + full_df['timea8']
full_df['time_s9'] = full_df['timef9'] + full_df['timea9']
full_df['time_s10'] = full_df['timef10'] + full_df['timea10']
full_df
Out[ ]:
session_id site1 site2 site3 site4 site5 site6 site7 site8 site9 ... time_s1 time_s2 time_s3 time_s4 time_s5 time_s6 time_s7 time_s8 time_s9 time_s10
0 2 890 941 3847 941 942 3846 3847 3846 1516 ... 201402 201402 201402 201402 201402 201402 201402 201402 201402 201402
1 3 14769 39 14768 14769 37 39 14768 14768 14768 ... 201312 201312 201312 201312 201312 201312 201312 201312 201312 201312
2 4 782 782 782 782 782 782 782 782 782 ... 201403 201403 201403 201403 201403 201403 201403 201403 201403 201403
3 5 22 177 175 178 177 178 175 177 177 ... 201402 201402 201402 201402 201402 201402 201402 201402 201402 201402
4 6 570 21 570 21 21 178 175 177 177 ... 201403 201403 201403 201403 201403 201402 201402 201402 201402 201402
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
82792 82793 812 1039 676 3 167 45064 45065 384 23 ... 201410 201410 201410 201405 201405 201405 201405 201405 201405 201405
82793 82794 300 302 302 300 300 1222 302 1218 1221 ... 201405 201405 201405 201405 201405 201405 201405 201405 201405 201405
82794 82795 29 33 35 22 37 6779 30 21 23 ... 201405 201405 201405 201405 201405 201405 201405 201405 201405 201405
82795 82796 5828 23 21 804 21 3350 23 894 21 ... 201405 201405 201405 201405 201405 201405 201405 201405 201405 201405
82796 82797 21 1098 1098 1098 1098 1098 1098 1098 1098 ... 201411 201411 201411 201411 201411 201411 201411 201411 201411 201411

336357 rows × 41 columns

удаляем старые колонки

In [ ]:
full_df = full_df.drop(['timef1', 'timef2', 'timef3', 'timef4', 'timef5', 'timef6', 'timef7', 'timef8', 'timef9', 'timef10','timea1', 'timea2', 'timea3', 'timea4', 'timea5', 'timea6', 'timea7', 'timea8', 'timea9', 'timea10'], axis= 1 ,)
full_df
Out[ ]:
session_id site1 site2 site3 site4 site5 site6 site7 site8 site9 ... time_s1 time_s2 time_s3 time_s4 time_s5 time_s6 time_s7 time_s8 time_s9 time_s10
0 2 890 941 3847 941 942 3846 3847 3846 1516 ... 201402 201402 201402 201402 201402 201402 201402 201402 201402 201402
1 3 14769 39 14768 14769 37 39 14768 14768 14768 ... 201312 201312 201312 201312 201312 201312 201312 201312 201312 201312
2 4 782 782 782 782 782 782 782 782 782 ... 201403 201403 201403 201403 201403 201403 201403 201403 201403 201403
3 5 22 177 175 178 177 178 175 177 177 ... 201402 201402 201402 201402 201402 201402 201402 201402 201402 201402
4 6 570 21 570 21 21 178 175 177 177 ... 201403 201403 201403 201403 201403 201402 201402 201402 201402 201402
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
82792 82793 812 1039 676 3 167 45064 45065 384 23 ... 201410 201410 201410 201405 201405 201405 201405 201405 201405 201405
82793 82794 300 302 302 300 300 1222 302 1218 1221 ... 201405 201405 201405 201405 201405 201405 201405 201405 201405 201405
82794 82795 29 33 35 22 37 6779 30 21 23 ... 201405 201405 201405 201405 201405 201405 201405 201405 201405 201405
82795 82796 5828 23 21 804 21 3350 23 894 21 ... 201405 201405 201405 201405 201405 201405 201405 201405 201405 201405
82796 82797 21 1098 1098 1098 1098 1098 1098 1098 1098 ... 201411 201411 201411 201411 201411 201411 201411 201411 201411 201411

336357 rows × 21 columns

и смотрим на результат

In [ ]:
full_df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 336357 entries, 0 to 82796
Data columns (total 21 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   session_id  336357 non-null  int64 
 1   site1       336357 non-null  int64 
 2   site2       336357 non-null  int64 
 3   site3       336357 non-null  int64 
 4   site4       336357 non-null  int64 
 5   site5       336357 non-null  int64 
 6   site6       336357 non-null  int64 
 7   site7       336357 non-null  int64 
 8   site8       336357 non-null  int64 
 9   site9       336357 non-null  int64 
 10  site10      336357 non-null  int64 
 11  time_s1     336357 non-null  object
 12  time_s2     336357 non-null  object
 13  time_s3     336357 non-null  object
 14  time_s4     336357 non-null  object
 15  time_s5     336357 non-null  object
 16  time_s6     336357 non-null  object
 17  time_s7     336357 non-null  object
 18  time_s8     336357 non-null  object
 19  time_s9     336357 non-null  object
 20  time_s10    336357 non-null  object
dtypes: int64(11), object(10)
memory usage: 56.5+ MB

данный признак был построен для учитывания помесячного линейного тренда за весь период предоставленных данных

2.3 Подготовка отчета¶

2.1 Визуальный анализ данных

  • В результате этого задания были в визуализированны зависимости атрибутов в наборе данных. Визуализация показала влияние атрибутов на целевую переменную, и данные были интерпретированны. В результате первой визуализации были показаны зависимости данных, а в результате второй визуализации мы увидели разброс данных

2.2 Конструирование признаков (Feature Engineering)

  • В результате этого задания был создан признак, который будет представлять собой число вида ГГГГММ от той даты, когда проходила сессия. Таким образом, мы будем учитывать помесячный линейный тренд за весь период предоставленных данных. Были добавлены новые признаки, которые на мой взгляд позволят улучшить качество выбранной модели. Была написанна функция для создания новых признаков. Были описаны приемы генерации новых данных и результаты. В результате этого были получены более 10 новых признаков времени Признак time* был построен для учитывания помесячного линейного тренда за весь период предоставленных данных

2.3 Подготовка отчета

  • В результате этого задания был создан отчет который показывает всю проделанную работу и результаты.